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Abstract 

We define the group lasso estimator for the natural parameters of the exponential families of 
distributions representing hierarchical log-linear models under multinomial sampling scheme. 
Such estimator arises as the unique solution of a convex penalized likelihood program using 
the group lasso penalty. We illustrate how it is possible to construct, in a straightforward way, 
an estimator of the underlying log-linear model based on the blocks of non-negative coeffi- 
cients recovered by the group lasso procedure. We investigate the asymptotic properties of the 
group lasso estimator and of the associated model selection criterion in a double-asymptotic 
framework, in which both the sample size and the model complexity grow simultaneously. We 
provide conditions guaranteeing that the group lasso estimator is norm consistent and that the 
group lasso model selection is a consistent procedure, in the sense that, with overwhelming 
probability as the sample size increases, it will correctly identify all the sets of non-zero interac- 
tions among the variables. Provided the sequences of true underlying models is sparse enough, 
recovery is possible even if the number of cells grows larger than the sample size. Finally, we 
derive some central limit type of results for the log-linear group lasso estimator. 

1 Introduction 

Log-linear model analysis of categorical data is a widespread and important set of statistical method- 
ologies that have found applications in very diverse scientific areas, ranging from social and bio- 
logical sciences, to medicine, disclosure limitation problems, data-mining, image analysis, finger- 
printing, language processing and genetics. Inherently, log-linear modeling is a model selection 
methodology for contingency tables that encompasses testing a number of statistical models for 
the joint distribution of a group of categorical variables. The classical asymptotic theory of model 
selection and goodness-of-fit testing is well developed and understood for the "small p and large A^" 
case. It is applicable to a variety of goodness-of-fit measures, such as Pearsons , the likelihood 
ratio statistic and, more generally, the power-divergence family of Cressie and Read (1988). The 
applicability and validity of these methods demand the availability of large sample sizes and the 
existence of the maximum likelihood estimate. 
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In recent years, the importance and usage of log-linear modeling methodologies have increased 
dramatically with the compilation and diffusion of large databases in the form of sparse contingency 
tables. In such instances, the number of sampled units is not much different, in fact often smaller, 
than the number of cells, so that most of the cell entries are very small or zero counts. In high- 
dimensional settings, the traditional methodologies indicated above are simply inadequate. First 
off, the number of log-linear models grow extremely fast with the number of variables (for example, 
there are 7,580 hierarchical models for a 5-way table!), and selecting an optimal model involves 
exploring a space of models of virtually infinite dimension. Secondly, for a given model of even 
moderate complexity, the MLE is unlikely to exist: the small information content in the data limits 
the possibility for inference only to a portion of the parameter space (see Rinaldo, 2006a, for 
details). As a result, traditional goodness-of-fit testing and model selection will produce very poor, 
if not completely erroneous, asymptotic approximation. It is quite clear that a more appropriate 
statistical formalization requires the consideration of a "large p" setting. 

In this article, we propose a methodology for log-linear model selection that is particularly 
suited to high-dimensional tables, and we describe some of its asymptotic properties. Our results 
are akin to the asymptotic optimality of the lasso estimator in high dimensional least squares prob- 
lems, where the recovery of the sparsity pattern of an unknown set of parameters in noisy settings 
via li regularization is possible, even if the number of parameters grows faster than the sam- 
ple size. See, in particular, Meinshausen and Biihlmann (2006), Zhao and Yu (2006), Wainwright 
(2006) and, for a different approach, see Greenshtein (2006) and Greenshtein and Ritov (2006). 
Existing work on penalized likelihood problems involving li regularization for discrete problems 
include the non-asymptotic analyses of estimation in high-dimensional generalized linear models 
via lasso by van de Geer (2006a,b), and the sufficient conditions of Wainwright et al. (2006) for 
consistency of ti regularized logistic regression with binary variables under a double asymptotic 
framework. In section 5, we discuss in detail the differences between our problem and solutions 
and the existing results. 

We formulate the log-linear model selection problem as a convex penalized likelihood problem 
based on the group lasso, a convex penalty function introduced by Yuan and Lin (2006) in a non- 
asymptotic ANOVA setting. The group lasso regularization is an extension of the lasso li penalty 
designed to penalized groups of coefficients simultaneously. It has been shown to be effective in 
logistic regression problems by Meier et al. (2006) and has been used in applications involving 
log-linear modeling of sparse contingency tables in Dahinden et al. (2006). 

The paper is organized as follows. In section 2 we describe the log-linear model settings we will 
be considering. The direct sum decomposition of the natural parameter space by log-linear sub- 
spaces defines a partition of the parameters in blocks of different dimensions, which are utilized 
as argument of the group penalty function. In Section 3 we describe the group lasso estimator for 
log-linear models, which can be computed by solving a convex program and can be interpreted as 
a smoothed MLE (see Section 3.1). Taking advantage of the combinatorial properties of log-linear 
models, we show that the group lasso estimator produces, in turn, an estimator of the underlying 
log-linear model, which is constructed simply by isolating the non-zero blocks of the group-lasso 
estimates. In Section 4, we study the consistency properties of the group lasso estimator and of 
the associated model selection procedure. We formulate a rather general doubel-asymptotic frame- 
work in which we allow both the sample size and the model complexity to grow. We break down 
our analysis into different steps, each step establishing progressively stronger results and, accord- 
ingly, requiring stronger assumptions, than the previous one. In Section 4.2, we derive conditions 
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guaranteeing that the group lasso estimator is norm consistent. Our assumptions rely on an exten- 
sion of local approximations by quadratic mean differentiability of regular models to the double- 
asymptotic settings we consider. In Section 4.3, we strengthen our assumptions to guarantee that 
the model estimates are consistent, i.e. that, asymptotically, the group lasso procedures correctly 
identifies the set of interactions making up the underlying model. We conclude our analysis with 
some central limit results in Section 4.4, which prove, in particular, that the group lasso estimator 
is asymptotically biased and inefficient. The proofs appear in Section 6 and in the Appendices. 

2 Log-linear Models 

We adopt the usual log-linear modeling setting, which we formalize below. We consider K cate- 
gorical random variables Xi, . . . , Xk, each taking values on a finite set, which, without loss of 
generality, can always be assumed to be = {!,..., Ik}. Letting J = f^^^^Ik, K"^ is the vector 
space of If -dimensional arrays of format Ti x . . . x Ik, i.e. the vector space of real-valued functions 
defined on I. Each element of I, a cell, is a multi-index {ii, . . . , ix), whose A;-th coordinate indi- 
cates the value taken on by the k-th variable. For convenience, we identify IR-^ with the Euclidian 
space W , where / = Yik h, so that standard inner product {x, y) = J2iei ^iVi °^ defined. 
(This identification can be easily made by ordering 1 as a linear list using any bi-jection between 
I and the set {1,2,..., /}.) Therefore, each cell can be represented by a single index i between 1 
and /, rather than by a multi-index. 

The cross-classification of N independent and identically distributed realizations of {Xi, . . . , Xk) 
produces a random integer-valued vector n G R-^, a contingency table, whose coordinate entry 
^ii,...,iK corresponds to the number of times the cell combination (ii, . . . , ik) was observed in the 
sample. The table n has a Multinomial (A^, tt), distribution, where tt is a strictly positive probability 
vector with coordinates 

'^h,...,iK = ^i{Xi,. . . ,Xk) = iii,...,iK))- 

In log-linear modeling, the joint distribution of {Xi, . . . , Xk) is fully specified by representing 
the cell mean vector m = En = N-rr by means of certain linear subspaces M of containing log m, 
to the extent that log-linear models themselves are defined by such subspaces. Namely, by fixing 
M, it follows that the logarithms of the cell mean vectors must satisfy specific linear constraints, 
to be specified below, which completely characterize the underlying distribution. The log-likelhood 
function at a point fi e M is 

= V naog + log iV! - V log n,!, 
^-^ (m, 1) ^-^ 

where m = exp^ and 1 is /-dimensional vector containing ones. Because of the Multinomial 
sampling assumption, £ is only defined over the subset M M given by 

M = {fie M: (m,!) = N} , 

which is neither a vector space nor a convex set. Instead, it is convenient to explicitly discard the 
subspace of M that is fixed by design and to work with the smaller linear subspace M D TZ{1)-^, 
where '7^(1) is the one-dimensional subspace spanned by 1. For each i3 e M D TZ{1)^, set 

r(/3) = (n,/3) - Aflog(exp^,l) +logiV! - J]]logni!. 
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Then, there exists a bijection between M and M n 7^(1)"'" in the sense that, for each fi e M there 
exists one (3 e Mn TZ{1)-^ such that 

^ N a 

a = T. — r exp''^ and t{(3) = £(fj,), for each n, 

(expP, 1) 

and, conversely, for each /? G M □7^(1)-'- there exists one Jl € M satisfying the above identities 
(for a proof of this result in more generality, see Lemma 2.2 in Rinaldo, 2006b) 

Therefore, if U is any full-rank matrix whose columns span M D TZ{1)-^, then l3 = \]9 for some 
G M'^, with k = dim{M) — 1, so that the log-likelihood function can re-written as 

= (U^n,^) - iVlog(expU^l) +logiV! -^logrii!, G M^'. (1) 

This re-parametrization is essentially equivalent to reduction to minimal form of the underlying 
exponential family of distributions for the cell counts via sufficiency. In fact, the previous display 
shows that each log-linaer model correspond to a full, regular exponential family of dimension 
dim{M) — 1 and natural sufficient statistic U^n. 

The gradient and Hessian matrix for £{0) are easily derivable. Letting b = exp^^, one can see 
that 

V^(^) = (n- (j^^ = (n - m) (2) 

and 

where m = -^vfiyb = E^n and Dm denote the diagonal matrix with diagonal m. It is worth pointing 
out that, because these models are exponential families, the negative Hessian is the covariance 
matrix of the natural sufficient statistics U^n and also the Fisher information multiplied by the 
sample size. 

2.1 Log-Linear Subspaces 

In this section we construct the log-linear subspaces we will be considering. Although log-linear 
models are defined by generic linear manifolds of R-^, in practice it is customary to consider only 
very specific classes of linear subspaces, which are also characteristic of ANOVA models and exper- 
imental design. These subspaces present considerable advantages in terms of interpretability and 
ease of computation and can be constructed easily by exploiting various correspondences between 
combinatorial structure of the power set of /C = {1, ... , K} and a certain direct sum decomposition 
of M-^, to be described below. 

A rather intuitive way of specifying a certain dependence structure among the K variables 
of interest is to provide a list of the interactions among them. Then, the associated statistical 
model is representable as a class of subsets of/C = {1,2,..., K}, each one indicating a different 
type of interaction. In fact, every subset h oi K, can be given a straightforward ANOVA-type of 
interpretation, based on its cardinality \h\, so that h identifies an interaction of order |/i| — 1 among 
the variables {i: i ^ h}. For example, if = 1, then /i is a main effect, if /i = 0, then h is the grand 
mean, and so on. 
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Formally, let be the power set of /C, which we view as a boolean lattice with respect to the 
partial order induced by the operation of taking subset inclusion. An abstract simplicial complex A 
on /C is a class of subsets of K. such that /i c d for some d G A implies /i G A. A simplicial complex 
is uniquely determined by its elements that are maximal with respect to inclusion, known as its 
facets, which represents the highest order interactions. Therefore, A can be identified with the set 
of its facets, a convention we will use throughout the article. By construction, once an interaction 
term is part of the model, all lower order interactions are included, i.e. the model is hierarchical. 
Then, hierarchical log-linear model can be defined in a purely combinatoric form as a subset of 2'^ . 

Definition 2.1. A hierarchical log-linear model is a simplicial complex A on /C. 

This definition provides a formal justification of the traditional notation (see, e.g.. Bishop et al., 
1975, Chapter 3) of identifying hierarchical log-linear models with classes of maximal subsets of 
/C, sometimes denoted as generating classes, indicating the maximal order interactions. Notice also 
that Definition 2.1 includes as special case the class of graphical models, defined as follows. For 
every complex A, one can construct its interaction graph, the graph with vertex set /C and edge set 
consisting of all unordered pairs c IC such that C d for some d G A. A simplicial complex 
A is called graphical if its facets are the cliques of its interaction graph. From the probabilistic 
point of view, lack of an edge between two nodes or sets of nodes is a formal representation of 
various Markov properties of conditional independence among the corresponding variables (see, 
e.g., Lauritzen, 1996). 

Example 2.2 (Hierarchical log-linear models). A = |{1},{2}{3}| is the model of mutual in- 
dependence of the three factors and A = |{1, 2}, {2, 3}| denotes the model of conditional inde- 
pendence of factor "1" and "3" given factor "2", a decomposable model. The simplest example 
of a graphical non-decomposable (and non-reducible) model is the 4-cycle model on 4 factors, 
A = |{1, 2}, {2, 3}, {3, 4}, {1, 4}| . The simplest non-graphical model is the model of no-3-factor 

effect A = 2}, {2, 3}, {1, 3}| . In general, for a K-way table, the largest hierarchical log-linear 
model is the model of no-iT-factor effect, represented by the simplicial complex on K nodes whose 
K — 1 facets form the set of all possible distinct subsets of IC with cardinality K — 1. ■ 

There is a remarkable correspondence between the combinatorics of simplicial complexes and of 
certain orthogonal subspaces of M^. In fact, for a given simplicial complex A, a log-linear subspace 
Ma can be constructed in a natural way as the direct sums of orthogonal subspaces indexed by 
subsets of IC. Specifically, 

Ma= Uh, (4) 

{hCd: deA} 

where UhUJy for h,h' (1 IC with h / h'. Below, we summarize the main features of this construc- 
tion that are relevant to our problem. Notice that the resulting log-linear subspaces are precisely 
the factor-interaction subspaces and the subspaces of interactions, as described by Darroch and Speed 
(1983). 

For any subset h C IC and cell index i = {ii, . . . , ix) G 1, let ih denote its coordinate projection 
{ifc : G h] onto (g)^^^ J^. Define the equivalence relation on X given by 

. h . 



5 



for all z, j G J and associate to each h (1 )C the subspace Wh C M.-^ consisting of all functions on J 
that depend on i G J only through ih, i.e. 

Wh = {f GR" :/{{)= f{j) if i^j], (5) 

Letting W'^ = {Wh ■ h G 2'^}, where the subspaces are defined as in (5), the posets W'^ and 2'^ are 
isomorphic lattices because 

Wh' C Wh h' 

for all /i, h' G 2'^, with the 6 and i elements of 2^^ and W'^ being and /C, and 7^(l) and M^, 
respectively. The full extent of this correspondence is explained in the next theorem. For a proof 
see Lauritzen (1996, Appendix B.2) and, for further details and alternative derivations, Rinaldo 
(2006b, Section 2). 

Theorem 2.3. For any h, hi,h2€ 2^, 

h = hiUh2 ^ Wh = Wh,+ Wh2 

h = hinh2 ^ Wh = Wh,nWh^ 

Also, letting 

iih = Whnl Yl 

\{h'G2'C: h'Ch} 

then 

1. for any h, h' G 2'^, with h ^ h' the subspaces Uh and Uh' are orthogonal to each other; 

2. for each /i G 2^ 

= Uh'. 

{h'e2'^: h'Ch} 

In particular, for h = K,, 

K^=0Z^h'. (6) 

h'CK 

Remark 

Throughout the document, we will be assuming that the elements of 2^ are ordered in some pre- 
defined way, and that any indexing by subsets of K. is done accordingly. 

Theorem 2.3 is of great practical value, as it provides the linear algebra tools needed to construct 
the log-linear subspaces (4) . Then, any hierarchical log-linear can be equivalently specified either 
combinatorially, using Definition 2.1, or by the vector subspace defined in (4). Furthermore, the 
dimension of the log-linear subspaces Ma and of all of its subspaces of interactions can also be 
computed directly, according to the next statement. 

Proposition 2.4. Let Hhe a class of subsets ofK, and Mn = ®hen^h- Then 

dimiMn) = 5^ n (^-^ - 1) • 
hen keh 
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In particular, for any log-linear model A, 

\fc=l / {h<^2>^ : hgd, deA} k(^h 

with the convention that, for h = Yikehi^f^ ~ 1) = 1- 

The appendix contains the proof of Proposition 2.4, along with an algorithmically simple way 
of generating the design matrix \Jh spanning the subspace Uh, for each h. More generally, one may 
assume the columns of each \Jh to be an orthonormal system, although the matrices constructed in 
the appendix would work just as well in our results. 

For matrices Ui, . . . , 11^ with the same number of rows r and number of columns ci, . . . , c^, 
respectively, we will denote the operation of adjoining them into one matrix of dimension r x 
with 

n 

0Ufc = [Ui...U„]. 

fc=l 

Then, using this notation and with a full-rank matrix spanning ^/j, the columns of 

Ua= U;, (8) 

he2'^,h^fll,hCd,deA 

span Ma, and, therefore, U is a full-rank design matrix for the log-linear model A. 



3 The Group Lasso Estimator for Log-Linear Models 

Following the results in the previous section, the columns of the matrix 

U= U, 

span where 

rank(U/i) = dim(Uh) = dh, 
is given in (7). Accordingly, for any vector 9 G W^^, we can write 

6 = vec{9h,he2'^,h^(i)} , 

where Oh denotes the -dimensional vector of 9 corresponding to the sub-matrix \Jh- Then, using 
(1), the log-likelihood function for the saturated (/ — 1) -dimensional log-linear model becomes 

i(0)= {\jlii,eh) - Nlogiex.pi^'^.'^T^'i'^''^''},!) +logm-^logni\, e R^-\ (9) 

h,h^9 i 

Notice that the one-dimensional sub-space 7^(1) corresponding to the empty set is not included, 
because of the multinomial sampling restriction. 
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For any non-trivial (i.e. different from the uniform distribution) model A, with corresponding 
log-linear subspace AIaj let 

H = /i / 0,/i C some d G a|, (10) 

be the collections of sets representing all the interactions in A, or equivalently, the collections of 
factor interaction subspaces of A^Aj so that 

dim(A^A) -1 = ^4 = dn- 

hen 

If H is not empty (i.e. A is different than {0}), it is is clear that the natural parameter space for A, 
i.e. M*^^, can be embedded as a linear subspace of M^^^ consisting of all vectors such that 

f \\eh\\ > 0, h£n 
\ \\eh\\=o, h^n, 

with II • II being any norm on M^~^. The log-likelihood function for this model is still given by 
Equation (9), where the summations are now taken over the sets h in the class H. 

Then, the model selection problem of recovering the underlying model A from an observed 
table n can be cast as an estimation problem for the block components of 9 e M^~^ that have 
positive norms, based on the likelihood (9) of the saturated model. To this end, one is naturally led 
to consider penalized maximum likelihood estimation procedures of the form 

max \£(0) -pen(e)}, (11) 

where i{9) is defined by (9) and pen{9) assigns a penalty to every block 9h that is non-zero. Ideally, 
the function pen should satisfy two requirements. First, it should act as a thresholding function by 
either keeping or killing (i.e. setting to zero) each block 9h, h c IC, h ^ 0. Secondly, it should be 
reasonably well behaved (e.g. be convex) so that the problem (11) is computationally feasible. 

Yuan and Lin (2006) propose the group lasso procedure for Gaussian models, based on a class of 
convex penalty functions which are specifically design to produce sparsity in the vector of estimated 
coefficients at the block level. These penalty functions are obtained as compositions of the £i norm 
over quadratic norms of the individual blocks. The group lasso penalty results from applying first 
the quadratic penalty to individual blocks, to promote non-sparsity, and then from applying the 
ii norm to the resulting block norms, to promote block sparsity. The group lasso methodology 
of Yuan and Lin (2006), originally developed for linear Gaussian models under ANOVA settings, 
was further extended to logistic regression models by Meier et al. (2006) and to log-linear models 
by Dahinden et al. (2006), which inspired our work. Specifically, the group lasso estimator for 
log-linear model we consider arises as the solution of the concave optimization problem 

argmaxggKi-i Pa{9), (12) 

where 

PA{e) = - X Ah||^/^||2, 

with £{■) defined as in (9) and A = {X,{Xh,h / 0}} is a set of given tuning parameters. The 
parameter A controls the overall effect of the penalty and should be a function of the sample 
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1. 


Obtain the log-linear group lasso estimator, 




e = argmaxggiR/-i Pa{9). 


2. 


Extract the set of non-zero blocks from 6, 




n = {h: \\ehh>o}. 


3. 


Recover the hierarchical log-linear model from H, 




A = {d: d is a maximal element ofH} 



Table 1: The group lasso model selection for hierarchical log-linear models. 

size, while the block parameters \h allows for specific penalties depending on the sizes of the 
individual blocks. A reasonable choice for these tuning parameters is = \fdh, so that each 
block of coefficients is penalized proportionally to its dimension, with larger blocks penalized more 
heavily. 

Lemma 3.1. The program (12) admits a unique optimizer 6 G M^"^ whose h-block component satisfies 

N h\ )^ ''\\eh\\2 ■' ^' ^ (13) 

^||UT(n-r5i)||2<AA/, if = 0, 

where m = - — exp^^ = Es^n. 

Having obtained the group lasso estimator 9, the model selection step entails building an esti- 
mate of the true model A by extracting the blocks of 6 with positive norm and then build a simplicial 
complex A as illustrated in Table 3. One may say that this procedure is effective at recovering the 
underlying set of interactions if, with high probability, A is sufficiently close to A. We call this 
property model selection consistency. Notice that, since we are only concerned with finding good 
estimators of A, it is not required of 6 to satisfy any optimality criteria as an estimator of 6^, besides 
the ones leading to model selection consistency. In fact, we will see in Section 4.4 that 9 is far from 
being optimal. 

The main advantage of using the group lasso estimator for estimating A rather than tradi- 
tional methods of model selection based on sequential testing of a potentially very large number 
of competing models, is the computational ease. In fact the methodology described in Table 3 only 
involves determining a penalized maximum likelihood estimator of 6^ and thus require solving only 
one convex optimization problem. 

We conclude this section remarking that one could make different choices for the group penalty 
function. In particular, following Tropp (2005), one could consider a penalty term which is built 
using the £i norm over the ^oo norms of individual blocks. This particular choice would assign a 
milder penalty for complexity than our choice, for the same set of tuning parameters A. 
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3.1 The Group Lasso Estimator as a Smoothed MLE 

The proof of Lemma 3.1 shows that the group lasso penalty implies that the selected model A 
is one for which the maximum likelihood estimates exist. In fact, comparing the the group lasso 
estimator with the MLE for the model A one gets a better insight on how the group penalty works. 
We assume here some familiarity with theory of exponential families. See Brown (1986), or, for 
results specific to log-linear models, Rinaldo (2006a). 

For convenience, and without loss of generality, we replace P\ with NP\ in (12). The condition 
of optimality for 9 implies that, for each h efi, 

9lvl{n-m) = NXXh\\9h\\2. 

Let A be the simplicial complex derived by H, as described in Table 3 and be defined as in (8). 
Consider the vector 

^GL = 9h- 

{h: ,hy^<l),hCd,deA} 

At the optimum, the objective function becomes, disregarding an irrelevant additive constant, 

(n,W) -iVlog(expU^,l) -Y^NXXhWOhh = (n, TO) - iV log(expU^ 1) - ^ ^U;[(n - rJi) 

/leW hen 

= (m, m) - N log(expU^, 1) 

= (Ulm,^GL)-iVlog(exp{U^^GL},l). (14) 

On the other hand, the optimal value of the log-likelihood function under the model A is achieved 
at the MLE ^mle; and is equal to 

(Uln,4iLE) - iVlog(exp{U^^MLE}, !)• (15) 

Equation (15) elucidates a fundamental fact from the theory of extended exponential families, 
namely that the MLE, ^mle, and the minimal sufficient statistics, Uln, are in one-to-one corre- 
spondence with each other, through mean value parametrization, in the sense that the observed 
minimal sufficient statistics is the expected value of the minimal sufficient statistics with respect 
to the distribution identified by the MLE itself. In fact, the MLE is determined as the inverse of 
the mean value parametrization evaluated at the sufficient statistics. Furthermore, the mean value 
parametrization is, in fact, a homeomorphism between the natural parameter space and the cone 
generated by the columns of U^, called the marginal cone. 

The clear similarity between Equations (14) and (15) reveals that the penalized estimator ^gl 
arises in a very similar fashion as the ordinary MLE, the crucial difference being that the mean value 
homeomorphism is no longer evaluated at the minimal sufficient statistics but at the different point 
Tjlfn. Because, unlike the observed table n, the vector of fitted values m is strictly positive by 
construction, the point Ulfn belongs to the relative interior of the marginal cone and, thus, can 
be seen as a smoothed version of the sufficient statistics Uln. Geometrically, the penalty function 
pulls the the sufficient statistics away from the boundary of the marginal cone. This forced amount 
of smoothness injected by regularization is controlled by the tuning parameters in A, and affects 
the asymptotic properties of 9, which, according to our results in Section 4.4, is asymptotically 
biased and inefficient. Nonetheless, the model estimator A is consistent. 
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4 Asymptotic Analysis 



4. 1 Introduction 

We will provide an asymptotic analysis of the model selection procedure described in Table 3 by 
studying the properties of the group lasso estimator. 

We will consider a rather general double-asymptotic framework, in which we allow both the 
sample size and the complexity of the statistical model to grow simultaneously. In particular, we 
will be assuming a sequence of statistical experiments consisting of log-linear models over an in- 
creasingly large set of cell combinations, implied by both a growing number of categorical variables 
and a growing number of levels for the variables, and with increasing sample size. To formally rep- 
resent this sequence of experiments, we will introduce a "time" variable n, which serves merely as 
an index and is not necessarily a quantification of the rate of increase of the sample size. Intuitively, 
the larger the index n, the bigger the contingency table, the larger the sample size and the more 
complex the model selection problem. 

To be specific, at time n, 

• it is available a multinomial sample of size Nn from the joint distribution of Kn categorical 
variables, each defined over a finite set X,,^ = {1, j„ = 1 ... , Kn, the support of this 
distribution is the set Z„ = 1j„ of all cell combinations, of cardinality /„ = Ij„ ; 

• the true underlying distribution is defined by a hierarchical log-linear model A„, as described 
in Section 2.1: the observed cell counts come from an exponential family distributions with 
log-likelihood function (9) and true natural parameter 0^ G M''^""^ such that ||6'° II2 > for 
K G Hn and \\el^ II2 = for /i„ Hn, with Hn defined as in (10); 

• the vector of true parameters 9^ is estimated by solving the program (12) with tuning param- 
eters An = {Xn, {Xhn ,K / 0}}; 

• the group lasso estimate 6n is then used to estimate A„ as described in Table 3, leading to the 
optimal selected model A„. 

In the the rest of the article we will use the notation G to denote a sequence of vectors 

such that tn G M''"", for every n. 

We remark that the true model at each "time point" n needs not be related with the true models 
at different values of n. The sequential setting we adopt is a convenient device for representing 
very generally an asymptotic frameowork for log-linear model selection with a diverging number of 
parameters; in fact, there are many factors that may increase the complexity of a log-linear model 
(e.g., number of variables, number of interactions in the model, number of levels for each variable) 
that we found it convenient to just allow each of them to change at every n. 

In our sequential setting, the probability spaces are allowed to change with n and, when we 
speak of convergence in probability to a constant or of tightness with respect to the index n, we ex- 
plicitly refer to a sequence of different probability measures. Accordingly, we will use the stochastic 
small and large order notation op„ and Op^ respectively with an index n for the probability mea- 
sures. This notation is well-defined: see Definition 7.11 and Lemma 7.12 in Schervish (1998). 

We will embed the true parameter 9^ in M'^^" and denote it with 9^^^. We will indicate with 
{tt^ }„ the sequence of true probability vectors and with {m^}„ the sequence of mean vectors, with 
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= A^nTT^, for each n. One may take note that the Fisher information matrix at is 

with maximal and minimal eigenvalues denoted by l^^^ and Z™"\ respectively. The negative Hessian 
of the log-likelihood function is 

^n. = (d^o - u^^ = iV„F„„, 

which is the the covariance matrix of the natural sufficient statistics n„. 

In the reminder of the article we will study some of the asymptotic properties of the sequence 
of lasso estimates {9n}n generated according to the previous scheme. Although framed in a more 
general setting, our results carry over to standard asymptotic framework in which only the sample 
size Nn and the set of penalty parameters A„ change with n. In our analysis we will establish a series 
of progressively stronger results, which, naturally, demand increasingly stronger assumptions. In 
Section 4.2, we show that 0„ is a norm-consistent estimator of 6^ and in Section 4.3 we prove the 
stronger property of model selection consistency, i.e. 

limP(A„ = A„) ^ 1. (16) 
Finally, in Section 4.4 we give a central limit theorem for 



4.2 Estimation consistency 

In this section we establish a rather weak, but nevertheless essential, consistency property for the 
group lasso estimator Suppose we knew the sequence of true models A„ and we estimated 

with the group lasso estimate 9n by solving (12), where the parameter space is now M'^^™ . We give 
sufficient conditions to establish that 9n is a ^2 -consistent sequence of estimators, in the sense that, 
for each e > 0, 

limP(||^„-e^J|2 > e) =0. (17) 

Because we are taking norms over Euclidian spaces of arbitrarily large dimensions, and using the 
chain of inequalities ^00 < ^2 < ^1, the previous result implies £00 -consistency but not ^1 -consistency. 
For the same reasons, model selection consistency (16) does not follow from (17). In fact, according 
to our method of proof, estimation consistency (17) is a necessary condition for model consistency, 
for which we need additional conditions. 

We show (17) by establishing a more refined consistency property for 9n- Under the assumption 
of Theorem 4.1, the lasso estimator enjoys the same convergence rate as the maximum likelihood 
estimator for exponential families with diverging number of parameters found by Portnoy (1988). 
Following a similar remark in Portnoy (1988), we point out that the conclusion of the theorem 
holds only for individual sequences of true parameters {9y^^^}n and may not be expected in general 
to hold uniformly over subsets of M^'W" . 

Theorem 4.1. Consider the settings described in Section 4.1. Assume 

1. [NC.l] dn^=o{Nn); 
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2. [NC.2] the uan condition (19) is satisfied; 

/^max 

3. NC.31 -HsSn- = 0(1) where /™^^ are the minimal and maximal eigenvalue of the Fisher 

'"n 

information matrix, respectively; 

4. [NC.4] A„ = O (J^ y ' . ). 
Then, 

The proof of Theorem 4.1 reHes on a more general result about sequences of regular statisti- 
cal experiments with diverging number of parameters, which is of independent relevance and is 
described in the next sub-section. 
Remarks 

Theorem 4.1 should be compared with Theorem 2.1 Portnoy (1988), establishing norm consistency 
of the MLE for exponential families with an increasing number of parameters. In fact, the conclu- 
sion of our theorem still holds if we replace [NC.2] with the condition expressed in Equation (2.4) 
in Portnoy (1988, page 359), which constraints the rate of growth of the expectations of third order 
moments over compact neighborhood of (see also Ghosal, 2000, for similar conditions). Both 
our uan condition and Equation (2.4) in Portnoy (1988) are used to control the order of magnitude 
of the remainder term in the local quadratic approximation of the log-likelihood function around 
6^^, uniformly over compact neighborhoods. This essentially guarantees that, with high probabil- 
ity and for large enough n, the log-likelihood function behaves around like a concave function. 
We conjecture that condition (2.4) in Portnoy (1988, page 359) is milder than our uan condition, 
but, as stated, it is only applicable to exponential families. Similarly, condition [NC.4] is needed to 
guarantee that, as n grows, the penalty function would not disrupt this local quadratic approxima- 
tion. Therefore, by (18), the group lasso estimator will eventually lie, with high probability, within 
the same neighborhoods of the true parameter sequence as the MLE. 

Theorem 4.1 is also similar to Theorem 1 in Fan and Peng (2004), whose assumptions result 
from a direct adaptation to the present settings of the classical regularity conditions for efficient 
likelihood estimation involving uniform (with respect to n) boundedness of the third-order deriva- 
tives of the density functions in neighborhoods of the true parameter vector (see, e.g. Lehman and Casella, 
1998, Theorem 6.5.1). We remark that these assumptions lead to a much weaker control over the 
remainder term of the quadratic approximation, which must be compensated by requiring a faster 

rate of growth of the sample size compared to the number of parameters, namely 0. 

The condition [NC.3] on the eigenvalues of the Fisher information matrix is also used in Portnoy 
(1988, Theorem 2.1). Geometrically, it prevents the ellipsoids t^I{6^Jtn (see definition of /(•) 
below) from getting over-stretched along some directions, thus destroying concavity. Statistically, 
it preserves identifiability of the true models, as n ^ oo. 

4.2.1 A quadratic approximation result for regular models in the double asymptotic setting 

Consider a sequence of statistical models {(P^, X„, JF„)}„, where, for each n, Vn = {Pe„,(^n £ ©n} 
is a family of probability distributions over some Borel space (X„, jr„) that is dominated by some a- 
finite measure fj,n with densities pg^^ = and C E'^". Suppose also that fc„ t oo as n t oo. We 
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further assume that, for each n, {Vn,^n,^n) is regular at 9^ that is (see, e.g. Bickel et al., 1993), 
6^ G int(0„), {Vni^ni^n) IS quadratic mean differentiable at 6^ with quadratic mean derivative 
rjgo and the Fisher information matrix 



I {On) = 4y VeJlen^ifJ-n 

exists (see, e.g., Lehman and Romano, 2005, Chapter 12), is non-singular and continuous at 6*°. 

Associate to the sequence of models (P„,X„,jr„) a sequence of experiments consisting, for 
each n, of an iid sample {Xi , . . . , Xn^ ) of size Nn drawn according to a distribution indexed by an 

unknown parameter 0° G int (Bn), where Nn T +oo in such a way that ^J^- i 0. Let {£n}n denote 
the sequence of likelihood functions and let 

" ./AT ^ ' 



be the score function. 



Proposition 4.2. Consider the settings described above and assume farther that, for all sequences 
{tn}n G such that tn = \fk^Xn, with \\xn\\2 < C for each n, the following uan condition is 

satisfied 

max|Ti„| = opo(l), (19) 

where 



Then, for any finite C > 0, 

4 [el + - = {tn. Zn) - \tli{el)t^{i + op^„ (1)) 

as n ^ oo, uniformly over all sequences {tn}n defined above. Furthermore, sufficient conditions for 
(19) are that "^y " ^ 0, with Z™'^^ the largest eigenvalue of I{6n), and that, for each sequence {tn}n 
such that lim„ ||tri||2 = 0, 



''"112 
kn 



(20) 



as n oo. 



Remarks 

The above result is a natural extension to our settings of the standard local asymptotic approxi- 
mation of the log-likelihood function for regular statistical models with fixed-size parameter space. 
The intuition for considering sequences of parameters living within a O (v^) neighborhood comes 
from simple high-dimensional geometry. In fact, for fixed-size parameter space, it is well known 
that the local approximation holds uniformly for values of the local parameter within a blow-up 
versions by a factor of ^/A^ of sets of order 0(1). That is, as more data become available, the local 
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view gets finer and the quadratic approximation gets better, uniformly for values of the original pa- 
rameter over compact balls of radii shrinking as When the dimension of the parameter space 
kn increases with n as described above, the positive effect of a larger sample over the quality of 
the local approximation, measured by a//^, is diminished by the increase in complexity, measured 
by Vk^- To see this geometrically, notice that, in order for a sequence of closed balls in Euclidian 
spaces of different dimensions to look proportionally the same, each ball must be a rescaled version 
of the unit- volume ball in its ambient space. That is, the radii need to grow like \/k^. If no such 
adjustment is made, any compact ball of fixed radius would very rapidly become minuscule as n 
grows, merely due to the increase in dimensionality. Consequently, the local parameters are now 

free to vary within balls that grow slower with Nn, at the rate \J~^- 

Besides weaker requirements on the amount of smoothness, there is one more reason why 
quadratic mean differentiability seems particularly convenient. In fact, for regular models, the 
Hellinger distance induces on 6„ approximately the same topology as the Euclidian distance, at 
least locally, so that quadratic mean differentiability represents quite naturally the underlying geo- 
metric changes induced by a simultaneous increase in dimensionality and sample size. 

When the parameter space has fixed dimension, the condition (20) reduces to quadratic mean 
differentiability and the uan condition (19) is always satisfied in regular models. However, with 
increasing dimensions, the uan condition guarantees an error of order op^^{kn) and is a much 
weaker requirement than (20), which can be shown to imply an error of order opg (1), just as 
in the fixed dimensional case. In fact, assumption (20) is strong enough to eliminate the effect of 
increasing dimensionality, by having the rate of increase in the sample size Nn amplified by the rate 
of increase of the parameter space, i.e. so that the norm of —^tn is dimension independent. 
This can be achieved if kn = o{Nn), for some 1/2 < a < 1, the case a = 1/2 being what is needed 
for some of the central limit results we give in Section 4.4. Finally, it is worth pointing out that 
quadratic mean differentiability at each n implies (20) only along subsequences { {Vn^ , , ^nj)}j 
of models. 



4.3 Model Selection Consistency 

Having established norm consistency for the group lasso solution, we then proceed to derive suf- 
ficient conditions for the stronger property of model selection consistency (16). Our method of 
analysis is based on linearizing the sub-gradient optimality conditions (13) via a Taylor expansion 
around the sequence of true parameters On- Norm consistency results turn out to be necessary to 
guarantee enough stochastic control over the remainder term of that expansion. The conditions we 
develop are quite similar in spirit to the ones arising from the study of sparse recovery of a linear 
signals under Gaussian or white noise using the lasso penalty (see, in particular, Wainwright, 2006; 
Zhao and Yu, 2006). 

Recall the definition of Tin from (10) and let H";, = 2'^" \ {Tin n 0), so that 11612^112 > for each 
hn G Hn and ||6'°„||2 = for each m„ G 0.^. Consider the sequence of events 

On = {\\0hj2 > Oyhn G Hn} H {||^.,J|2 = 0,Vu;„ G k} ■ (21) 

Then, the model selection consistency property (16) of the group lasso solutions 6'„ is equivalent to 
convergence in probability of 0„, namely 

limP(C'„) = l. (22) 
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Define the two random vectors 



m„ = vec|Afe„ f^" ,hn G Hn] 

^ \\0hA2 ^ 

and 

Chi = vec|A^„„2■^„,'^z;n G W^} , (23) 

where \\zwA2 < 1- 

The vector ffu,-, is an exphcit function of the group lasso estimator (it is not related to the 
quadratic mean derivative r/go). Using the optimality conditions (13) and applying a Taylor expan- 
sion of m„ around mj^, it can be verified that On holds if and only if both the equations 

9n„ = 6^^^ + iV„S^i {w^^nM- " <^ " W^^n.Rn - XnVn?) (24) 

and 

AnCwc = -i-U^. (n„-m°)-^U^ic i?„-W„ (^^U^ Jn„ - m") - ^U^„i2„ - A„r?„„ j , (25) 
hold, where {{Rnh = o{\\0n„ - O^Jh), and 

Wn = U^c (^1)^0 — j ^n„^n„- 



We rely on Equations (24) and (25) to derive sufficient conditions guaranteeing that 

limP (||^,,J|2 > 0,V/i„ G W„) = 1 (26) 

and 

limP ( max ||z^J|2 < 1 ) = 1. (27) 
In turn, (26) and (27) together imply that (22) holds. 
Remarks 

We point out that equation (27) in particular entails an asymptotic evaluation of the probability of 
^2-unit balls in high dimensional settings, whereas for the model selection consistency of the usual 
lasso procedure it is enough to consider only the ^00-unit ball. This difference is rather substantial 
and demands a much stronger control of the asymptotic behavior of the group lasso estimator 
than it is needed for the ordinary lasso estimator. One can see this geometrically, for example, by 
noticing that, in M'^, the volume of the ^00-unit ball (i.e. the unit cube conv{— 1, +1}'^) is 2^^, while 

the volume of the ^2-unit ball (i.e. the unit sphere) is roughly > so that, for large k, the 

unit sphere is just an extremely tiny fragment of the cube. A more refined geometric argument can 
be made using Dvoretszky's Theorem (see, e.g., Pisier, 1999), according to which the ^2-unit ball 
can be approximated arbitrary well by almost spherical sections of the £00 -unit ball of dimension 
about log A;. That is, when seen from inside the /c-dimensional unit cube, the unit sphere looks 
approximately like a slice of a sub-space of dimension log k. Probabilistically, one can also observe 
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that for a fc-dimensional vector X = {Xi, . . . , Xk) of i.i.d. random variables with sub-gaussian tails, 
E||X||oo = O(VIogl), while E\\X\\2 = 0{Vk), as k oo, so that control over the £^ unit ball can 
be achieved by smaller sample sizes. 

The restrictions imposed by (27) are due to the choice of the quadratic norm for the penalty 
function and cannot be avoided. By the same token, using an i^o norm in the penalty function 
would require dealing with £i-unit balls, which are even smaller than i?2-unit balls and in general 
harder to control. In this respect, the ii lasso penalty, which results in a i^o norm condition for the 
sub-gradient, appears to be optimal. 

We will specialize the assumption [NC.3] about the eigenvalues of the Fisher information by 
requiring that 

[NC.3'] < D^in < l^"^ < < D^ax < oo. 

Note that this condition does not preclude maxj^ vr? from decreasing to zero and, in fact, this 
assumption will be used in Section 4.4. 

Theorem 4.3. Assume the conditions of norm consistency [NC.l], [NC.2], [NC.3'] and [NC.4]. Then, 
Equation (26) holds if 



1. [MSC.l] letting an = min;,„gH„ iKjh, 



(28) 



which, for \h„ = simplifies to ( + ~^ ^■ 

Equation (27) holds if 

1. [MSG. 2] ('almost' parameter orthogonality) for each Wn G 



< 



(1-6) 



(29) 



2. [MSC.3] 



\7i^\ min„,„g-Hc X^^^ 



3. [MSC.4] 



\/4 



max 



which, for the choice X^^ = \fdh'^, requires X'^Nn oo. 
Remarks 

The condition [MSG. 2] implies that 



|W„| 



u. 



U 



^-1 



< (1 
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which is the equivalent of the irreducibihty condition used in Wainwright (2006) and Zhao and Yu 
(2006), using the EucHdian norm. In fact, under the Gaussian homoschedastic ensemble settings, 
the irreducibihty condition on the design matrices is precisely an 'almost' parameter orthogonality 
condition on the Fisher information matrix for the saturated model. The notable difference in our 
[MSG. 2] is the need to account for the number of blocks of parameter which are zero. 

MSG. 3 is a sparsity condition that imposes bounds on the model complexity in terms of the 
number of interactions that can be recovered. 

Gondition MSG. 4 indirectly provides some information on the rates of increase for the dimen- 
sions for the subspaces of interactions not included in the true models, i.e. of /„. In fact, setting 
Xh^ = 1 for all hn and n, and noting that 



max dw„ 



n 



1), 



[MSG.4 ] requires 



n 



0. 



{In 



Furthermore, since, 

dn-„ = ^ 

a more stringent sufficient conditions implied by [MSG. 4] (compare, e.g., the bound in Equation 
(15) a) in Wainwright, 2006) is 



dn,. 



0. 



For = ^Jdh^, [MSG.4] reduces to Xl^Nn oo, so that the block penalties are of the exact order 
of magnitude needed to eliminate the effect of the direct dependence of /„ on A^^^ displayed above. 

The proof of the Theorem shows that it is possible to make more refined assumptions, so that 
[MSC.l] may be specialized as follow. 

Addendum 4.4. Let /i^ = Hn \ hn, then Equation (26) holds if 
1. [MCS.V] 



UL D 



mm 



Dr 



mm 



< 



and 



which, for Aft„ 



— ( ^ max ^/dh^+Xn max Xh„ 



'dh^, simplifies to 

max/,„6^^„ 



0, 



(30) 



(31) 



Ctr, 



+ A„ 



0. 



This condition (30) requires the Fisher information matrix to behave approximately as a block 
diagonal matrix, with the blocks given by the covariance matrices T,h^, hn G Hn- If this is the case, 
then (26) is verified under (31), which is a much weaker requirement than (28). 
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4.4 A Central Limit Theorem for the Log-Hnear Group Lasso Estimator 



Our final results concern the large sample properties of the distribution of lasso group estimates 
{9n}n- In addition to the conditions guaranteeing both norm and model selection consistency, 
we need to impose further restrictions guaranteeing some form of asymptotic normality under 
our double asymptotic framework. The main rationale behind retaining the set of assumptions for 
consistency is that they allow us to work only with the simpler and well-behaved sequence of events 
On defined in (22), which converges in probability. 

In general, asymptotic normality under the double asymptotic settings obtains under stricter 
assumptions than under standard (i.e. with fixed-dimensional parameter space) asymptotic prob- 
lems. Below, we provide a series of conditions, each providing a sense that for large enough n, 
the group lasso estimates (appropriately rescaled and translated) are close to a standard Normal 
distribution. We point out that only the first two results in the theorem statement amount to a 
full central limit statement, while the third result only offers necessary condition for asymptotic 
normality. 

To state our result, we need to formulate some notation. Let J^^^ be a d-^^ x d-^^ block-diagonal 
matrix whose /i„-block is the d^^ x dh„ matrix 

X ^ It - ( ^ 

^'"\\olh[''-~\\el\\.[\Kju) 

with hn € Tin and with 1^^^ denoting the d/^^ -dimensional identity matrix. In the following, Cn will 
denote ak x dn,, matrix, where k is an arbitrary fixed number, such that 

limC„C^ = G (32) 

n 

for some k x k nonegative and symmetric matrix G. 

Theorem 4.5. Assume the conditions for norm and model selection consistency and let 

Xn = TiV^F^f ((Fw„ + XnJ^nJ i^n - o'nj + Krfn}) ■ (33) 

1. For each sequence {C„} of k x d-^n matrices satisfying (32), C„X„ converges weakly to the 
Nk{0,G) distribution if either the [CLT] condition 



or both the [CLT.Ma] condition 
and [CLT. Mb] condition 



dn„ = o(ivV2) 
dn^=o{Nn) 



maxvrj' = O 



1 



hold. 
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2. If the [CLT.BE] condition 

dH^=o{Nll'), 

holds, then 

sup \¥{Xn e An) - ¥{Zn G ^„)| ^ 0, 

where Zn has a N^^^ (0, 1^^^ ) distribution, the supremum is taken over the convex sets An in 
M.'^-Hn and convergence occurs at the rate O 

3. Let ipn cind (pn denote the characteristic functions of Xn and Zn, respectively. Then, for each 
e > and T > 0, there exists a n°(e, T) such that, for all n > n°(e, T), 

sup{|V'„(t)-(^„(t)|: ||t||2 <t} <e. (34) 

if the [CLT.LF] condition 

holds. 
Remarks 

1 . The theorem is mainly of theoretical interest, and it indicates that the group lasso estimate is 
asymptotically unbiased and inefficient. In fact. Equation (33) demonstrates that the asymp- 
totic behavior of the group lasso estimator is affected by two terms. One is the bias term 
XuVhu which depends on the gradient of the penalty function at the true parameter. The 
other term J^^ is the Hessian at the true parameter of the penalty function, a positive defi- 
nite matrix which inflates the inverse Fisher information. Both these terms are asymptotically 
significant and indicate that the group lasso estimates may lack asymptotic optimality. Note 
that this phenomenon is probably quite general. In fact, additional terms of this sort appear 
also in a similar result in (Fan and Peng, 2004, Theorem 2) . 

2. Condition [CLT] is a rather weak one but nonetheless it guarantees, via a simple Lindberg- 
Feller argument, the asymptotic normality of a fixed number of linear combinations of the 
coordinates of On- In particular, it includes that case of 

Cn = [h O] 

where O is a A; x {dfi,, — k) matrix of zeros. For this particular choice, [CLT] guarantees the 
marginal asymptotic normality of any fixed number of coordinates of On- 

3. Condition [CLT.BE] is a full central limit type of results for the group lasso estimator and is 
based on a multivariate Barry-Esseen type of bound found in Bentkus (2003). As it is usual 
with uniform results of this type, it is necessary to control the fluctuations of third order 
moments, and, consequently, to have a rather large sample size. To our knowledge, this is 
the best rate available. See also Portnoy (1986) for a similar result requiring only a rate 
dy^ = o{Nn'^), whose applicability and relevance to our problem is however unclear. 




20 



4. The last set of conditions, [CLT.LF], and [CLT.Ma] and [CLT.Mb], establish control over the 
second moments and are thus quite mild. They both lead to a double-asymptotic version of 
Lindberg-Feller condition, with [CLT.Ma] and [CLT.Mb] stemming from on a more elaborated 
method of proof relying on a generalization of some results by Morris (1975). The Lindberg- 
Feller condition [CLT.LF] requires a slower rate of increase of d-^^ with respect to the sample 
size Nn than the other conditions based on [CLT.Ma] and [CLT.Mb], which however demands 
some control over speed at which the maximal cell probability tends to 0. However, they do 
not produce a central limit type of result. They only guarantee that, for n large enough, the 
characteristic function of an appropriate affine function of the group lasso estimate is very 
closed to the characteristic function of a standard Gaussian, within a large compact balls. 
Although it is well known that, for multidimensional problems, closeness of characteristic 
function by itself is not enough to guarantee closeness of multivariate distributions (Senatov, 
1998, see, e.g.,), nonetheless, this result provides some sense that the group lasso estimate 
may behave, for large n, like a Gaussian vector. 

5. Instead of [CLT.Mb], one may want to enforce the more specialized assumption maxjgj^ vr? < 
CI~^, for some positive constant C, (see, e.g. Quine and Robinson, 1984, Theorem 1). Then, 
in order for the Theorem to hold, one has to further assume that 



which is compatible with the conditions for norm consistency. 

5 Conclusions 

Our problem and results differ from existing analyses of £i regularized least square problems in at 
least three aspects. 

Firstly, unlike the case of regularized least squares or Gaussian error problems, the first order 
optimality conditions for the group lasso program are non-linear in the parameters. As model 
selection consistency hinges upon establishing appropriate bounds for the norms of the differences 
between the blocks of true and estimated parameters, our strategy in Section 4.2 was to linearize 
the sub-gradient equations via a first order Taylor expansion. This expansion, in turn, is valid 
provided one has enough control over the remainder term, which we achieved by proving the 
norm consistency property for the group lasso estimate. Thus, in our settings, norm consistency 
is necessary for model selection consietcny. In contrast, for quadratic problems, whose first order 
conditions are linear in the parameters, norm consistency does not appear to be needed, although, 
it may still be important for central limit results, like in our case. 

Secondly, the sub-gradient optimality conditions for the group penalty function are formulated 
in terms of the £2 norms of groups of parameters, and not in term of the £00 norm for the whole 
parameter vector, which is the case for lasso-based procedure. As discussed in section 4.3, this 
difference is crucial and is the main reason why our convergence rates require a much larger 
sample size than for the £1 penalty. 

Thirdly, in our problem we do not need to concern ourselves with random design. This is 
a consequence of the contingency table settings and simplifies the analysis. More generally, as 
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we are working with exponential families of distribution, the Fisher information matrix is data- 
independent. Consequently, unlike for example the case of Gaussian ensembles, for model selection 
consistency it is sufficient to impose analytic, and not stochastic, conditions on the asymptotic 
behavior of the Fisher information. These conditions (namely, the almost parameter orthogonality' 
condition [MSG. 2]) correspond to the various irreducibility condition used in the lasso literature, 
that we equivalently formulate in terms of the Fisher information. 

Except for the irreducibility conditions, we did not take advantage of the exponential nature of 
multinomial distribution, so that the norm consistency and central limit results of sections 4.2 and 
4.4, respectively rely on rather general properties that may hold for other families of distributions. 



6 Proofs 

Proof of Lemma 3.1. The first order optimality conditions for a vector 6 £ R^^^ is G dPA{9), 
the sub-differential set of P\{0). The gradient oi i{d) was already given in Equation (2). As for the 
penalty term, which is not differentiable when some of the blocks are zero, we use the fact that if 
1 1 • 1 1 is a norm in an Banach space X, then the subdifferential at a point x is 



d\\x\ 



{x* G X*: ||x*|| < 1} ifx = 

{x* G X*: \\x*\\ = l,{x,x*) = \\x\\] if X / 0, 



where X* denotes the dual space of X. Then, since the dual space of L2 is L2, we conclude that 
for any x G M^'^"^^ the subdifferential of Ylh ^hWxhlh at is a subset of M^-^^^) comprised by 
vectors whose h block component is 

J {A/,xGM^ ||x||2 < 1} ifeh = o 

Equation (2) and (35) implies (13). As for uniqueness, the results in Rinaldo (2006a, Section 3) 
show that the solution to the likelihood equations is always unique, unless the maximum likelihood 
estimator does not exist, in which case for every sequence of parameters {0„}„ such that 

lim^(e„) = sup £{0), 



\\0n\\ 00. However, the penalty term would prevent this from happening. This, combined with 
the strict convexity of the £2 norm, guarantees uniqueness. ■ 

Proof of Proposition 4.2. We follow the proof of Lehman and Romano (2005, Theorem 12.3) (see 
also Bickel et al., 1993, 509-513). Using a Taylor expansion of log(l + x), we can write 



where r{TiJ) ^ as Tj^ 0. Because of the uan assumption (19), 

^mj = opo(l)- (36) 
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Next, by quadratic mean differentiability, for each {tn}n, 



which implies that 



Nn 



Then, 



Next, because 



we have 



2 Jm{Xij 



where f R^^diin = and, since 



E 



Rn 



61" 



we arrive at 
Finally, noting that 

I'll 



and using (19), we conclude, by the weak law of large numbers, that 



Combining (36), (38), (40) and (41), we get 



30 I 



1 



which gives the desired result, since = 0{knln'^'^)- 
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To show that (20) impHes (19), write for convenience Si„ = 2 '^^ii'" ^, and notice that tlSi 



is of order 0(ZJf^^/c,,) and that, under (20), 



1 



■/Nn 



Then, following Bickel et al. (1993, page 510), for each e > 0, 



imj>e) < p(|r,„-(t„-^,5,J|>f^+p(Kt„^,5,J|>f) 



1 +T 



t„ Si 



S'- 



> e 



which implies (19). 



Proof of Theorem 4.1. For ease of notation we write Opo and opo for Opo and Opo , respectively. 
Also, for simplicity, we multiply Pa„ by in (12). The conditions of proposition 4.2 are satisfied 
for the sequence of exponential families under considerations, with r/go = Vp^o i — , 9^ = and 

fen = dn„. Then, for any finite C > and for each sequence G 0„]R'^'"" with ||x.„||2 < C, 

the term ^„ (^9^^ + \[^x^ - ini9^J is equal to 



For the first term, we notice that 



— ^nU7^„ (nn -m„) - -— X„S7^„rE„(l +Opo(l)). 



2 iV„ 



Thus, by Chebyshev and Cauchy-Swartz inequalities. 



TttT 



(42) 



Using the fact that the Fisher information matrix is positive definite for each n, 



l^xl^n„Xn{l+Opo{l))2 < -l^iV„||Xn||2C,x(l+Opo(l)) < -^Opo {dHjriXnWD (1+Opo(l)). 



(43) 



Combining (42) and (43), by choosing C large enough, for each e > 0, and all n large enough 

' + a/^x„1 - U9'J <o]>l-e. 



sup tn tl„ 



Nn 
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Next we consider the difference in the penalty terms between and 6^^^ + ^J^^Xn- By a first 
order Taylor expansion, 



where x* lies between and x„. Using the Cauchy-Stwartz inequality, the absolute value of the last 
quantity is bounded by 

KV ^ndHnWXnh ^ h 



Under the above condition, we see that — ^7*^^112 = Opo ^^J, as required. 
Proof of Theorem 4.3. We will deal with Equations (26) and Equation (27) separately. 
Proof of Equation (26) It is enough to show 



< an 



1, 



(44) 



where a„ = miuh^fz-n^ ll^h^lb- In fact, the former condition implies that the /i„-block of the vector 
inside the norm sign in the previous display is less than ||^JJ^J|2, V/in G 'Hn, which, by the triangle 
inequality, will produce the desired result. 
First we consider the term 

The vector U^ (n„ — m°) has mean zero and covariance matrix S7^„. Furthermore, because of 
NC.r, letting = A^in (S^ J, we have 



„,mm 

7w„ 



> -Dmiri > for all n. 



(45) 



Combining these observations, and using the formula for the expected value of a quadratic form, 
we arrive at 



where d-H„ = Y^h&in "^^en, Chebyshev inequality implies 



Op 



Nn 



Next, using (45) for the operator norm of of we get the upper bound 



2 Djjim 

which, for Xh„ = ^/d^, simplifies to -p^A„i/d^. 



(46) 



(47) 
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Finally, the norm of 
is no larger than 



-==Vd^M\\en„-0'nj2) 



Op 



(48) 



because \\Unj2< 

Using equations (46), (47) and (48), condition (44) is satisfied if MSG. 1 holds. 
Proof of Addendum 4.4. Using the properties of the inverse of a block-matrix, we can write 



'"Hn 



' A 


B ■ 


-1 


" (a-bc~ibT)-i 


A- 


1B(BTA-1B-C)-i 


bt 


C 




(B^A-^B - C)~1bTa-i 


(C 


- b^a-ib)-i 



If 



|a-ib||2 < 1, 



(49) 



then, because the positive definiteness, we obtain the bounds (see, e.g., Horn and Johnstone, 1990, 
for more details) 

(50) 



and 



||(A-BC-1bT)-1x||2 < \\A-^x\\l 

|A-1B(B^A-1B-C)-1x||2 < ||A-iB||2||(B^A-iB-C)-ix||2 

< IIC-ix||2 



(51) 



For any /i„ G Hn and with /i^ = Hn\hn, letting A = ( D 



and C = U)[c (-Dm — ) U/^c , we may decompose TiUn accordingly. Then, (30) guarantees that 



(49) is satisfied, for any choice of hn, and we obtain the bounds (50) and (51). On can proceed 



recursively by picking any set s„ G /i^ and applying the same arguments to U;[c (-Dm 

so that we exhaust all sets in Tin- In the end, we see that, using (30) repeatedly, the norm any 
block of vector in left hand side of (44) is bounded by 



mm 



Nn 



so that Equation (26) is satisfied if 



D 



mm 



Nn 



-1 



-^UT (n - m) - -^VlRn - K^L 



Nn 



< Or 



- 1, 

(52) 

for each /i„ G Tin- 
Noting that assumption NC.l' also guarantees that the minimal eigenvalue of T,h„ is bounded 
by Dj:ainNn, the Same arguments used in the proof of (44) show that (52) is true for each /i„ G Hn 
if both conditions in MSC.l' are verified. 
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Proof of Equation (25). 



In equation (25) write C,h'= = ^n^zn^ , where A?^c is a d^^c -dimensional diagonal matrix whose 
diagonal is vec{l^„A^„, G T~^n}> with lh„ denoting the -dimensional vector with entries all 
equal to 1. Then, letting 



(25) becomes 



1 



NnXn 



For any Wn ^'H'i, consider the corresponding block in the vector A^^ ^nVHn' i-^- the vector 

1 



A,, 



(53) 



Because of assumption MSC.2, the Euclidian norm of (53), for any choice of Wn G H^, is bounded 
by 

(1 - e) Y.h„eHn ^hr. 



which, in turn, is smaller that 



(1-6) 



I min^„„g7^c A^ 



Then, under MSG. 3 (53) will be eventually less than (1 — e), uniformly over Wn G Tin- 
Next, for each block Wn G H^, we study the vector 



(54) 



After some algebra, the covariance matrix of the term inside the parenthesis can be shown to be 
the matrix 



1/2 



whose largest eigenvalue is smaller than Nnl^^^^- Therefore, by Chebyshev inequality, the term (54) 
is of order no bigger than 




Under the condition [MSC.4], uniformly over Wn G H^, the expression (54) is opo(l). 

As for the reminder term it is easy to see that it converges in probability to 0, so that (27) holds 
true. 
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Proof of Theorem 4.5. All the claims in the proof are made inside the event On- Because the norm 
consistency assumptions are in force. On occurs in probability and, therefore, our claims hold true 



within a set or probability converging to 1. In particular, \\Onn~^nJ\'^ ~ ( y ] (-'- + '^p(^)) 



Op [ \l • Reorganize equation (24) as 



By similar arguments used in the proof of Theorem 4.3, the term 



is of order 



and therefore converges in probability to 0. 

As for T^y^J"^ NnXrifj-Hn, notice that, on On, the vector rf^^ is a differentiable function of 9n„ S 
M^^" . Then, using a Taylor expansion around 8^^^ , 

S-f iV„A„^H„ = ^nJ'NnXn (^vl^ + J?,„ {en^ - e'n,.) + op (^^^j ^ • (56) 
The remainder term in Equation (56) is of order 



which become negligible for Xn = O i i — j (obviously, Xn = O will do). Then using 

(55), we obtain 



S^f Jn„ - mO) = S^f ((S^,, + iV„A„ J - 0°) + iV„A„r?o, J + opo(l). (57) 
Thus, we only need to consider the term E^y^U^ (n„ — m^). For 1 < jn < Nn, let 

where the variables Xj^ are iid Multinomials with size 1 and probability vector 7r°. Then, 

in 

where Ey^^ = 0, Covy^^ = ^1^^^ and ^^.^ cov(yjJ = 1^^^. 

The result for part 1. under assumption [CLT] is based on standard arguments and entails 
checking the multivariate Lindberg-Feller conditions. We omit this proof because it is almost iden- 
tical to the proof we produce below for the [CLT.LF] conditions, the only difference being a rate 
0{N~^) in equation (66). See also the proof of Theorem 2 in Fan and Peng (2004). 
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Part 2. of the theorem follows in a straightforward way from the main theorem in Bentkus 
(2003) and the fact that E\\Fn'^^'^\j'^^{Xj^ - 7r°)||^ is of order O (d^^y by the same arguments 

used in the proof of part 3., given below. 

Next we prove the result of part 1. under both [CLT.Ma] and [CLT.Mb]. We relax the assumption 
[CLT] by allowing the dimension of the parameter space to grow fatser. To this end, we derive 
multi-dimensional analogs of Lemma 2.1 and 2.2 and Theorem 2.1 in Morris (1975). In particular, 
our proof follows closely the proof of Morris (1975, Lemma 2.2). We first obtain joint limit law by 
using Lemma 9.1, and then establish the conditional limit law by using a multi-dimensional version 
of condition (2.9) in Morris (1975). Note that the result in Steck (1957) about conditional limit 
laws is actually a multi-dimensional one, but somehow was formulated in Morris (1975, Th. 2.1.) 
as one-dimensional. The conditional law we are interested is the distribution of Z„, defined below 
in (60). 

Let 7„ = N;^^CnT.T:ll^VJi^m^, and set A„ = CnS^^J/^U^^. Note that 01° = iV;„7rO, thus 
7„ = A^TT^. Denote the i-th column of A„ by Oj, i = 1, . . . Then, the left hand side of (57), 
pre-multiplied by C„, can be written as 

= ^ fiirii), 

where = (oj — 7„)(nj — m°). Let {Xi ; i = 1, . . . , /„} be independent Poisson random variables 

with mean m° = iV„7r^, so that E/i(Xj) = and cov(/j(Xj), Xj) = 0, by construction. Next, 
define 

K = N-'/^Y.iX,-m^) (58) 

i 

Un = (59) 

i 

where H„ = Ylii cov{fi{Xi)). A simple calculation though shows that H„ = C„Cj, a square matrix 
of fixed dimensions k x k. The goal is to prove the asymptotic normality of [/„ given {Vn = 0}, and 
then use the fact (underlying Morris' method) that 

CiE-^/^Zn) = CiUn\Vn = 0), (60) 

where £ stands for law. 

The random variables Vn have zero means and unit variances. Furthermore, by the same argu- 
ments used in the early parts of (Morris, 1975, Lemma 2.2), assumption [CLT.Mb] guarantees that 
the uan condition is satisfied, so the sequence Vn converge in distribution to a Gaussian variable. 
Similarly, the random vector C/„ satisfies E?7„ = 0, cov([/„) = Ik, the identity matrix of dimensions 
kxk, and, by construction, cov(Vn, C/„) = 0. We argue below that [/„ satisfies the multi-dimensional 
Lindeberg condition. By Lemma (9.1), this will imply the asymptotic normality of the joint limit 

law of {Vn,Un). 

By Schwartz inequality, for any e > 0, 



Y,^[\\Mx,)\\i, \\Mx,)h>e]<Y,[nh{Xi)\\inmxi)h>e^ 



1/2 

(61) 
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We will show that, for each e > 0, the right hand side of (61) tends to zero. Recall that fi{Xi) 
{o-i — ln){Xi — mPi)- The length of 7„ can be bounded as follows, ||7„||2 < ||Cn||2||S^y^U^„vi"nll2 < 



0(l)Dniin^n ^^^||U^„7r°||2. Elements of U^^vr^ are absolutely bounded by a constant Di, thus 



iTnIb < DNn^^'^d][\ Similarly, ||ai||2 



^ ^^^di^^, where is the standard unit 



■ 7n||2 



O [ Nn ^''^d\[^\ , which tends 



|Anei||2 < DNn 

vector in M^" with i-xh. coordinate equal to 1. Adding up, \\ai 

to zero by assumption [CLT.Ma] . 

Next, we use the following large deviation result for Poisson random variables, due to Bobkov and Ledoux 
(1998) and based on a modified logarithmic Sobolev inequality: 

Theorem 6.1. Let X he a poisson random variable with parameter A. Then, for every /i : N ^ M, 
with sup2,gp^ \ h{x + 1) - h{x)\ < h 



h{X) - m{X) >bj< exp 

for all b>0. 

Then, using Theorem 62, for some constant D, 



6 , / b 



(62) 



Xi 



m? > ella,- 



■7n||2 



< exp 



' log (l + 



1 



2 m^lla,- 



< 



exp 



DN^J^d'J^log(l + eD 



7n||2 
1 



exp 



-o 



(63) 



as n ^ oo. The last inequality follows by condition CLT.Mb. The same result may be achieved by 
applying a modified logarithmic Sobolev inequality to the left tail. 

Finally, Ei(IE||/i(^i)f )^''^ = Ei ||ai -7„||i(mO + 3(m°)2)V2^ which is ofthe order of magnitude 
of O (d-Hn)- This, together with (61) and (63) and assumption CLT.Ma, shows that Un satisfies the 
Lindeberg condition, as stated. 

We turn now to consider the conditional limit law. As mentioned above. Theorem 2.1. in Morris 
(1975) holds true also for multi-dimensional variables. We only need to replace condition (2.9) 
in Morris (1975) by a multi-dimensional version. Specifically, we show that 



lim supsupEll + Mi) - fi{Li) 



0, 



(64) 



where L„ = (Li, . . . , Lj^) and M„ = {Mi, . . . , Mj^) are Multinomial random variables with prob- 
ability vector ir^, and sample sizes Nn + VnNn"^ and rNn"^ , respectively, where the parameters 
Vn = 0(1) and r are specified as in Morris (1975, Lemma 2.2). Notice that fi{Li + Mi) — fi{Li) = 
{ui - -in) Mi. Thus, 



Y^{ai-^n)Mi 



KnMn - rNn^'^knTnl 



{Mn - EMn)^Bn{Mn " EM„) 
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where EAf„ = tNu'^-k'^, and B„ = A^A„. Taking expectation yields 



Since 



Therefore, 



m°(m°)^ 



tr B„ Ding 



E 



J][/,(L, + M,)-/.(L.)] 



tr(C„CT) = 0(l) 



0(1) 



0, 



which shows that condition (64) holds, and the statement in part 1. is proved. 

Finally, we prove part 3. of the theorem statement, by showing that the Lindberg-Feller condi- 
tions of Lemma 8.1 are satisfied. Under assumption [CLT.LF] we only need to show that 



as 71 ^ 00. Using Causchy Stwartz inequality and the fact that the Yj^ are identically distributed, it 
is sufficient to show that 

iV„(E||y,J|4p(||y,J|2>e))'/'^0. (65) 
By Chebychev inequality, for each e > 0, 



\Yuh>^)<'^ = 



Nn 



(66) 



Next, using the the assumption on the minimal eigenvalue of F„, 



Er>ii2<o 



^)E||US,.«.-.S)llt = o(%). 



where in the last step we use the fact that the entries of U^^(Xj„ — ttJ^) are bounded, uniformly 
over n. 

Combining the last two displays, the left hand side of (65) is of order 



NnO 



O 



d 



,3/2 ' 

Ti-n 

N'J\ 



which, in virtue of assumption [CLT.LF], vanishes, as desired. 
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7 Appendix A: Bases for Uh 



Given a log-linear model H, bases for the subspaces Uh = Mh, with h e H will be defined and 
computed. The term contrast bases is appropriate because they indeed correspond to contrasts in 
models of analysis of variance. Using Birch's notation (see, in particular, Bishop et al., 1975), the 
design matrix for Uh will encode to the u-terms corresponding to the -order interactions among 
the factors in h. 



For each term hQ K, and factor k e IC, define the matrix 

Zfc if k ^ h 
Ifc if /c /i, 



where is a. 1^ x {1^ — 1) matrix with entries 



V 





1 
-1 1 







\ 



1 

-1 / 



(67) 



and Ifc is the -dimensional column vector of I's. Let 

K 



u,. = (g)u: 



(68) 



fc=l 



Since the elements of are —1,0 and 1, U/i has entries that can only be —1,0 and 1. Additional 
properties of the design matrices Vh and hence of the subspaces spanned by their columns are 
given in the next Lemma. The results in Proposition 7 follow immediately. 

Lemma 7.1. 

i. For every h, h' G 2^, with h / h', the columns o/U/i are linearly independent and Vj^^h' = 0/ 

a. = ©;,g2^ n{\Jh); 

Hi. for any h G 2^^, 'R-i^h) = l^h, where Uh is the subspace of interactions for the factors in h. 

Proof. Part i. : the first statement follows from the fact that the columns of are independent 
for each k and h and U/j has dimension (^flfcLi ^fc) ■ for the second statement, without loss of 
generality, we can assume that there exists a factor k such that k E h and k ^ h' . Then, Z^ = Cfc 



and Z^' = Ifc, so (Z^) ' Z^' = 0, hence the resuk. 



Part it: by i., the subspaces 7?.(U/i), are orthogonal (hence the direct sum notation is well defined) 
and dim7^(U/,) = Y\j&h{h - !)■ Therefore: 



dim I n{\]h) = E - 1) = n 
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where the last equality follows from the Mobius inversion formula (see, e.g., Lemma 3.13 in 
Rinaldo, 2006b) and the fact that, for /i = 0, dim(U/j) = 1. 
Part Hi. . The matrix 

Sh = (g)S^ 



where 




ii k e h 
otherwise. 



is the projector onto Uh (see Corollary 3.10 Rinaldo, 2006b). Therefore it suffices to show Sh^h = 
U/i and SftU/j' = for /i / /i'. This implies TZ{\Jh) ^ and the results follow from the fact that the 
inclusion cannot be strict because of the orthogonal decompositions of as in it and Equation 
(6). It is easy to see that S^'U^ = and, for any /i / /i' with k' e h'\h, S^'U^, = S^,U^' = 0. 
Therefore, 



K K 

^h^h = (g) S^U^ = U,, and S;,U,,, = (g) S^U^' 

k=l k=l 



for any h ^ h'. 



8 Appendix B 

Lemma 8.1. Let Xj^, j„ = 1, . . . ,Nn, be i.i.d. vectors in M'^" with = and J2j„ covnXj^ = 

Ifc^. Assume that A;„ ^ oo and Nn ^ oo in such a way that 

^^'^^J2^n\\XaP{\\xa2 > eVK} = 0. (69) 

Let iljn he the characteristic function of ^ ■ Xj^ and (f>n the characteristic function of a kn- 
dimensional standard Gaussian distribution. Then, for each e > and T > 0, there exists a n^{e, T) 
such that, for all n > n^{e, T), 

sup - : ||i||2 < r} < e. (70) 

Proof The result follows from reduction to the one-dimensional case by the Cramer-Wold device. 
Consider an arbitrary sequence G (g)„]R'^". Let Wj„ = t^Xj^ and set 

4 = ^^aVnWj^ = NjtnWl 

jn 

so that — V • Wj has mean and unit variance. Notice that the Lindberg-Feller condition holds 
for the sequence of random variable — V • since, for each e > 0, 

hm^Y.^rrWll{\W,J > esn} = Hm I E„ (tlX,Vl{\tlX,J > ey^Wtnh} 

Jn 

^ -^Y.^n\\Xall{\\XjA2>e^n} = Q, (71) 
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by (69), the Cauchy-Swartz inequality and the fact 

{\tlX,J > ev^||t„||2} C {||X,J|2 > ey^}. 

Notice that (71) does not depend on the sequence and the value at 1 of the characteristic 
function of j-^j^Wj^ converges to exp^^/^, uniformly over all sequences {tn}- From this, it 
follows that 

SUp{\lpn{Un) - (pn{Un)\ : ||ti„|| = 1} ^ 0. 

The previous uniform convergence holds also for sequences {tn} such that pnib = T for each n 
and any T > 0, a fact that can be easily established once again from the Lindberg-Feller conditions 
for the one dimensional case in the chain of inequalities leading to (71). Formally, for any e > 
and T > there exists a ?i°(e, T) such that for each n > n^{e, T), we have sup{|^n(t) — (/)„(t)| : ||t|| = 
r} < e. As n*^(e, T) is non-decreasing in T for fixed e, the proof of (70) is complete. 



9 Appendix C 

The following Lemma is a multivariate analog of Lemma 2.1. in Morris (1975). 

Lemma 9.1. Let Sk = {Sik,Rk) = ELi^ifo ^here = {S2k, ■ ■ ■ , Spk), ^ik = {Xiik,Yik), 
and Yik = {Xi2k, ■ ■ ■ ^X^^^y Suppose that {X^fc}^^^ are independent random vectors, with EXji^ = 
0, EYjfc = 0, and Var{Sk) = Ip, thepxp identity matrix. Suppose Sik satisfies the uan condition, i.e., 
maxi<j<fc VarXiik = o(l) as k ^ oo, and that Sik ^(0, 1). Finally, suppose that Rfc satisfies the 
(multi-dimensional) Lindeberg condition, i.e., for all e > 0, 

k 

J^E[||Y,fcf ; llYifcf >e] =0(1) , (fc^oo). 
1=1 

Then Sk^Np{0,Ip). 

Proof As in Morris' proof, Sik satisfies the (one-dimensional) Lindeberg condition, i.e., 

k 

J2^[xl,;Xl,>e]=o{l) , (fc^oo). 

i=l 

Therefore, 

A: 

^E[||Xifc||2; \\X,kfe] = 
1=1 

k 

y^^E [^iik + llYjfclP ; Xf^k + llYjfeiPe] 
1=1 

k 

< 2^E [max{X2„ ||Y,fc||2} ; max{X2^, || Y,fc||2}e/2] 

1=1 

k k 

< 2^E [Xl, ■ Xl, > e/2] + 2^E [\\Y,kf ; ||Y,,f > e/2] = o(l) 

1=1 1=1 
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Thus, Sfc satisfies the (multi-dimensional) Lindeberg condition and the proof is complete (see, e.g., 
Bhattacharya and Rao, 1976, pp. 183-184). ■ 
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